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Abstract 

This paper presents a new method for estimating high dimensional covariance matrices. Our method, 
permuted rank-penaHzed least-squares (PRLS), is based on a Kronecker product series expansion of the 
true covariance matrix. Assuming an i.i.d. Gaussian random sample, we establish high dimensional rates 
of convergence to the true covariance as both the number of samples and the number of variables 
go to infinity. For covariance matrices of low separation rank, our results establish that PRLS has 
significantly faster convergence than the standard sample covariance matrix (SCM) estimator. In addition, 
this framework captures a fundamental tradeoff between estimation error and approximation error, thus 
providing a scalable covariance estimation framework in terms of separation rank, an analog to low rank 
approximation of covariance matrices fTJ. The MSB convergence rates generalize the high dimensional 
rates recently obtained for the ML Flip-flop algorithm Q, lEI for Kronecker product covariance esti- 
mation. We show that a class of block Toeplitz covariance matrices has low separation rank and give 
bounds on the minimal separation rank r that ensures a given level of bias. Simulations are presented 
to validate the theoretical bounds. As a real world application, we illustrate the utility of the proposed 
Kronecker covariance estimator in spatio-temporal linear least squares prediction of multivariate wind 
speed measurements. 

Index Terms 

Structured covariance estimation, penalized least squares, Kronecker product decompositions, high 
dimensional convergence rates, mean-square error, multivariate prediction. 

L Introduction 

Covariance estimation is a fundamental problem in multivariate statistical analysis. It has received 
attention in diverse fields including economics and financial time series analysis (e.g., portfolio selection, 
risk management and asset pricing [4]), bioinformatics (e.g. gene microarray data ||5l, |l6|, functional 
MRI [7]) and machine learning (e.g., face recognition [8|, recommendation systems ^). In many modern 
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applications, data sets are very large with both large number of samples n and dimension d, often with 
d ^ n, leading to a number of covariance parameters that greatly exceeds the number of observations. 
The search for good low-dimensional representations of these data sets has recently led to breakthroughs 
in multivariate statistics and signal processing. Recent examples include sparse covariance estimation fTOl, 
ifTTTl . lfT2l . |fT3l . low rank covariance estimation |[T4ll . |[T5l . |fT6ll . |[T1, and Kronecker product esimation 

El, m, HI, El, 0. 

Kronecker product (KP) structure is a different covariance constraint from sparse or low rank con- 
straints. KP represents apqxpq covariance matrix Sq as the Kronecker product of two lower dimensional 
covariance matrices. When the variables are multivariate Gaussian with covariance following the KP 
model, the variables are said to follow a matrix normal distribution |[T9l . lUTl . ||20|| . This model has 
applications in channel modeling for MIMO wireless communications 11211 . geostatistics Il22l . genomics 
|[23l . multi-task learning |[24ll . face recognition lUl, recommendation systems @ and collaborative filtering 
|[25l . The main difficulty in maximum likelihood estimation of structured covariances is the nonconvex 
optimization problem that arises. Thus, an alternating optimization approach is usually adopted. In the 
case where there is no missing data, an extension of the alternating optimization algorithm of Werner et 
al lITSl . that the authors called the flip flop (FF) algorithm, can be applied to estimate the parameters of 
the Kronecker product model, called KGlasso in ||2l. 

In this paper, we assume that the covariance can be represented as a sum of Kronecker products of 
two lower dimensional factor matrices, where the number of terms in the summation may depend on the 
factor dimensions. More concretely, we assume that there are d = pq variables whose covariance So has 
Kronecker product representation: 

r 

Eo = J]] Ao,7 (g) Bo,7 (1) 

7=1 

where {Ao,-y} are p x p linearly independent matrices and {Bo,^} are qx q linearly independent matrices 
Q We assume that the factor dimensions p, q are known. We note I < r < tq = min(p^, q"^) and refer to 
r as the separation rank. The model ([T]) is analogous to separable approximation of continuous functions 
ll26l . It is evocative of a type of low rank principal component decomposition where the components are 
Kronecker products. However, the components in ([T]) are neither orthogonal nor normalized. The model 
([1]) with separation rank 1 is relevant to channel modeling for MIMO wireless communications, where Aq 
is a transmit covariance matrix and Bq is a receive covariance matrix ||2TI . The model is also relevant 

'Linear independence is understood with respect to the trace inner product defined in the space of symmetric matrices. 
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to other transposable models arising in recommendation systems like NetFlix and in gene expression 
analysis ||9l. The model ^ with r > 1 also finds concrete applications in spatiotemporal MEG/EEG 
covariance modeling Jill, HH, (291, EHI and SAR data analysis ||3l]. We finally note that Van Loan 
and Pitsianis |32] have shown that any pq x pq matrix SIq can be written as an orthogonal expansion 
of Kronecker products of the form ([1]), thus allowing any covariance matrix to be approximated by a 
bilinear decomposition of the form ([T]). 

We propose a convex framework for estimating covariance matrices of the form ([T]). We call our 
method the Permuted Rank-penalized Least Squares (PRLS) estimator. We analyze the convergence rate 
of PRLS in the high dimensional setting. The main contribution of this paper is a convex optimization 
approach to estimating covariance matrices with KP structure of the form ^ and the derivation of tight 
high-dimensional MSB convergence rates as n, p and q go to infinity. For estimating separation rank r 
covariance matrices of the form ([T]), we establish that PRLS achieves high dimensional consistency with 
a convergence rate of Op ^ ''(p +i +i°g"iax(p,g,n)) y rpj^-^ significantly faster than the convergence rate 
Op of the standard sample covariance matrix (SCM). For r = 1 this rate is identical to that of 

the FF algorithm, which fits the sample covariance matrix to a single Kronecker factor. 

The PRLS method for estimating the Kronecker product expansion ([T]) generalizes previously studied 
Kronecker product covariance models liTTl . llT9l to the case of r > 1. This is a fundamentally different 
generalization than the r = 1 sparse KP models proposed in ||9l, 13, IS, |[33l . Independently in Q, 
lO and ll33l . it was established that the high dimensional convergence rate for these sparse KP models 
is of order Op (Pl^lMip^iMi!!) V While we do not pursue the additional sparsity constraint in this 



paper, we speculate that sparsity can be combined with the Kronecker sum model ([T]), achieving improved 
convergence. 

Advantages of the proposed PRLS covariance estimator is illustrated on both simulated and real data. 
The application of PRLS to the NCEP wind dataset shows that a low order Kronecker sum provides 
a remarkably good fit to the spatio-temporal sample covariance matrix: over 86% of all the energy is 
contained in the first Kronecker component of the Kronecker expansion as compared to only 41% in the 
principal component of the standard PCA eigen-expansion. Furthermore, by replacing the SCM in the 
standard linear predictor by our Kronecker sum estimator we demonstrate a 1.9 dB RMSE advantage for 
predicting next-day wind speeds from past measurements from the NCEP network of wind stations. 

The outline of the paper is as follows. Section [n] introduces the notation that will be used throughout 



the paper. Section III introduces the PRLS covariance estimation method. Section IV presents the high 



dimensional MSE convergence rate of PRLS. Section |V] presents numerical experiments. The technical 
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proofs are placed in the Appendix. 

II. Notation 

For a square matrix M, define |M|i = ||vec(M) || and |M|oo = ||vec(M)||^, where vec(M) denotes 
the vectorized form of M (concatenation of columns into a vector). ||M||2 is the spectral norm of M. 
Mjj and [M]j j are the {i,j)th element of M. Let the inverse transformation (from a vector to a matrix) 
be defined as: vec^^(x) = X, where x = vec(X). Define the pq x pq permutation operator Kp g such 
that Kp^qVec(N) = vec(N^) for any p x q matrix N. For a symmetric matrix M, A(M) will denote the 
vector of real eigenvalues of M and define Xmax{^) = ||M||2 = max Aj(M) for p.d. symmetric matrix, 
and Amm(M) = minAj(M). For any matrix M, define the nuclear norm ||M||^ = X][=i I'^K^)!' where 
tm = rank;(M) and cJi(M) is the ^th singular value of M. 

For a matrix M of size pq x pq, let {M(i,j)}^^^^ denote its q x q block submatrices, where each 
block submatrix is M(i,j) = [^][i-i)q+i:iq^(j-i)q+i:jq- Also let {'WL{k,l)}'j^ denote the p x p block 
submatrices of the permuted matrix M = K^^MKp ^. Define the permutation operator TZ : MPi^Pi — ). 
jgp^xg^ by setting the {i — l)p+j row of 7^(M) equal to vec(M(i, j))-^. An illustration of this permutation 
operator is shown in Fig. [T] 



Original Covariance 




Fig. 1. Original (top) and permuted covariance (bottom) matrix. The original covariance is So — Ao ® Bo, where Ao is a 
10 X 10 Toeplitz matrix and Bo is a 20 x 20 unstructured p.d. matrix. Note that the permutation operator TZ maps a symmetric 
p.s.d. matrix So to a non-symmetric rank 1 matrix Ro — 7?.(So). 
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Define the set of symmetric matrices = {A £ MP^^ : A = A-^}, the set of symmetric positive 
semidefinite (psd) matrices S^, and the set of symmetric positive definite (pd) matrices S^_^_. Id is a 
d X d identity matrix. It can be shown that is a convex set, but is not closed |[34l . Note that S^_^_ 
is simply the interior of the closed convex cone 5^^. 

For a subspace U, define Pu and Pjj as the orthogonal projection onto U and f/-*-, respectively. The 
unit Euclidean sphere in M'^' is denoted by S'^'^^ = {x G M.'^' : ||x||2 = 1}. Let (x)^ = max(2;,0). 

Statistical convergence rates will be denoted by the Op(-) notation, which is defined as follows. 
Consider a sequence of real random variables {Xn}neN defined on a probability space (^7, T, P) and a 
deterministic (positive) sequence of reals {6„}neN- By X„ = Op(l) is meant: sup^gj!^ Pr(|X„| > K) — > 
as K ^ oo, where X„ is a sequence indexed by n, for fixed p, q. The notation X„ = Op{bn) is equivalent 
to ^ = Op{l). By Xn = Op(l) is meant Pr(|X„| > e) — ^ as n — > oo for any e > 0. By A„ x 6„ is 
meant ci < < C2 for all n, where ci, C2 > are absolute constants. 

III. Permuted Rank-penalized Least-squares 

Available are n i.i.d. multivariate Gaussian observations {ztjJL^, where G W"^, having zero-mean 
and covariance equal to ([T]). A sufficient statistic for covariance estimation is the well-known sample 
covariance matrix (SCM): 

1 " 

Sn = - Ztzf (2) 

n ^-^ 

t=\ 

A penalized least-squares approach was proposed in 1 1 ] for estimating a low rank covariance matrix by 
solving: 

S^Garg min ||Sn - S||^ + A||S||, (3) 



where A > is a regularization parameter. For A = C^||I]o||2 y , where C" > is large 

enough, and n > cr(5]o) log^(max(2(i, n)) for some constant c > sufficiently large. Cor. 1 in [1] 

j_. 

r(So) log(2d) 



establishes a tight Frobenius norm error bound, which states that with probability 1 — 



||S^ - Sofp < inf ||So - S\\l + C||So|l2rank(S)- 

where r(So) = |^Jp ^ min{rank(So), d} is the effective rank lIH. 

Here we propose a similar nuclear norm penalization approach to estimate low separation-rank covari- 
ance matrices. Motivated by Van Loan and Pitsianis's work |32|, we propose: 

R^earg min ||R„ - R||^ + A||R||^ (4) 
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where = ^(Sn) is the permuted SCM of size x q^. The minimum-norm problem considered in 
133 is: 

min ||R„ — R||^ (5) 

ReRi'^x'J^rank(R)<r 

We note that (|4]) is a convex relaxation of Q and is more amenable to analysis. Furthermore, we show 
a tradeoff between approximation error (i.e., the eiTor induced by model mismatch between the true 
covariance and the model) and estimation enor (i.e., the error due to finite sample size) by analyzing the 
solution of Q. We also note that Q is a strictly convex problem, so there exists a unique solution that 
can be efficiently found using well established methods [34]. 

The closed form solution of (|4]) is given by singular value thresholding (SVT): 



(6) 



where Uj and Vj are the left and right singular vectors of R„. Efficient methods of solving such problems 
have been recently studied in the literature 1351 . l36ll . Although empirically observed to be fast, the 
computational complexity of the algorithms presented in |[35l and l36ll is unknown. For computation of 
the rank r SVD requires on the order 0{p'^q^r) floating point operations (see proof of Thm. [T]). However, 
faster probabilistic-based methods for truncated SVD take 0{p'^q^ log(r)) computational time ||37l . Thus, 
the computational complexity of solving (|4]) scales well with respect to the designed separation rank. 

The next theorem shows that the de-permuted solution is symmetric and positive definite under mild 
conditions. 

Theorem 1. Consider the de-permuted solution = 7^^^(R^). The following are true: 

1) The solution is symmetric. 

2) If n> pq, then the solution is positive definite with probability 1. 

Proof: See Appendix A. ■ 
We believe that the PRLS estimate 51^ is positive definite even \i n < pq for appropriately selected 
A > 0. In our simulations, we always found Xl^ to be positive definite. We also noticed that the condition 
number of the PRLS estimate is order of magnitudes smaller than the SCM. 

IV. High Dimensional Consistency of PRLS 



In this section, we show that RPLS achieves the MSE statistical convergence rate of Op 
This result is clearly superior to the statistical convergence rate of the naive SCM estimator, particularly 
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when p,q ^ oo: 

\\Sn-'So\\l = Op(^^y (7) 

The next result provides a deterministic relation between the spectral norm of R„ — Rq and the 
Frobenius norm of the the estimation error Ri^ — Rn. 



Theorem 2. Consider the convex optimization problem When A > 2||R„ — R0II2. the following 
holds: 

llR-n - R-oIIf < inf I ||R - RoIIf + ^^^/^^ \^rank{Yl)^ (8) 
Proof: See Appendix B. ■ 

A. High Dimensional Operator Norm Bound for Permuted Sample Covariance Matrix 
In this subsection, we establish a tight bound on the spectral norm of the error matrix 

A„ = R,-Ro = 7^(S„-So)• (9) 

The standard strong law of large numbers implies that for fixed dimensions p, q, we have A„ — )• almost 
surely as n — )• 00. The next result will characterize the finite sample fluctuations of this convergence (in 
probability) measured by the spectral norm as a function of the sample size n and factor dimensions 
p, q. This result will be useful for establishing a tight bound on the Frobenius norm convergence rate of 
PRLS and can guide the selection of the regularization paramater in Q. 

Theorem 3. ( Operator Norm Bound on Permuted SCM) Assume ||So||2 < 00 for all p,q and define M = 

max(p, q, n). Fix e' = ^. Assume t > max(y^4Ci ln(l + p-), AC2 ln(l + ^)) and C = max(Ci, C2) > 0. 

t_ 

Then, with probability at least 1 — 2M 4c, 



Cot \p' + q^+\ogM p^ + q^ + logM 

A„ 2 < T o^max<^ ,\ } (10) 

1 — 2e n V n 



for some absolute constant Co > 1^ 

Proof: See Appendix D. ■ 
The proof technique is based on a large deviation inequality, derived in Lemma [2] in Appendix C, that 
characterizes the tail behavior of the quadratic form x^A„y over the spheres x G ^P^-i and y G S'^^^^. 

^The constant in front of the rate can be tightened by optimizing it as a function of e' over the interval (0, 1/2), but is left 
as a finite constant here. 
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Using Lemma |2] and a sphere covering argument, the result of Theorem [3] follows (see Appendix D). 
Fig. |2] empirically validates the tightness of the bound ( [10] ) under the trivial separation rank 1 covariance 



Ij, (g) In. 




Fig. 2. Monte Carlo simulation for growth of spectral norm ||A,i||2 as a function of p for fixed n = 10 and q = 5. The 
predicted curve is a least-square fit of a quadratic model y = ax^ + 6 to the empirical curve, and is a great fit. This example 
shows the tightness of the probabilistic bound iflOl. 



B. High Dimensional MSE Convergence Rate for PRLS 

Using the result in Thm. [3] and the bound in Thm. [2j we next provide a tight bound on the MSE 



estimation error. 



Theorem 4. Define M = max(p,g,n). Set A = A„ = ^ max | P^+9^+i°g , yg±g!±iggii | y^r 
t > large enough (see Eqn. (10)). Then, with probability at least 1 — 2M~ic: 



\\±^,--So\\l< inf ||R-Ro||^ 

R:rank{R)<r 

, { ( p^ + (l + \ogMV p^ + g^ + logM \ 

+ 6 r max < , > (11) 

y\ n J n J 

for some absolute constant C > 0. 

Proof: See Appendix E. ■ 

1 1 2 

When there is no model mismatch the approximation error inf|j^.].an]j(R)<r} ||R— Ro||^ is zero and, as a 
result, in the large-p, q, n asymptotic regime where p^+q^ + logM = o{n), it follows that — So||p^ = 
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Op{y ^^ +<i _ Q^(^iy 'Phis asymptotic MSE convergence rate of the estimated covariance to 

the true covariance reflects the number of degrees of freedom of the model, which is essentially of the 
order of r(p^ + q^) total covariance parameters. This result extends the recent high-dimensional results 
obtained in ||2l, ||3l for the single Kronecker product model (i.e., r = 1). 

Moreover, we note that r < vq = min(p^, q'^). For the case when p ^ q, and r ~ ro, we have a fully 
saturated Kronecker product model and the number of model parameters are of the order ~ d^, and 
the SCM convergence rate (|7]) coincides with the rate obtained in Thm. [4] 

For covariance models of low separation rank-i.e., r <^ ro, Thm. |4] yields that the high dimensional 
MSE convergence rate of PRLS can be much lower than the naive SCM convergence rate. Thus PRLS 
is an attractive alternative to rank-based series expansions like principal component analysis (PCA). We 
note that each term in the expansion Ao,-^ ® Bo,-^ can be full-rank, while each term in the standard PCA 
expansion is rank 1. 

Finally, we observe that Thm. [4] captures the tradeoff between estimation error and approximation error. 
In other words, choosing a smaller r than the true separation rank would incur a larger approximation 
error inf|jt.j.an]i(pj^)<j,| ||R — Ro||^ > 0, but smaller estimation eiTor of the order Op( j'^^ +iog Af) 



n 



C. Approximation Error 

It is well known from least-squares approximation theory that the residual error can be rewritten as: 

irif ||R-Ro||^= ^fc(R-o) (12) 

R:rank(R)<r — 

fc=r+l 

Since we consider the high dimensional setting-i.e. the sample size n grows in addition to the dimensions 



p, q, the maximum separation rank ro also grows to infinity, which makes the approximation error ( 12 1 



infinite. According to Theorem [4j to estimate the covariance matrix in the mean-square sense up to a 



bounded approximation error, we need to ensure that the sum (12) remains finite as g — )• oo. For this 
to occur, the singular values of Rq need to decay fast enough and r needs to be appropriately chosen. 

We show next that the class of block- Toeplitz covariance matrices have bounded approximation error 
if the separation rank scaled like log(max(p, q)). To show this, we first provide a tight variational bound 
on the singular value spectrum of any p^ x matrix R. 

Lemma 1. (Variational Bound on Singular Value Spectrum) Let R be an arbitrary matrix of size p^ x q^. 
Let Y*}^ be an orthogonal projection ofW^^ onto M.^. Then, for k = 1, . . . ,ro — 1 we have: 

a2+i(R)<||(V-Pfc)R^||2 (13) 
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with equality iff = V^, where R = USV^ is the singular value decomposition. 



Proof: See Appendix F. ■ 
Using this fundamental lemma, we can characterize the approximation error for estimating block- 
Toeplitz matrices with exponentially decaying off-diagonal norms. Such matrices arise, for example, as 
covariance matrices of multivariate stationary random processes of dimension mand take the block form: 

I](0) S(l) ... ^{N) 

5](-l) S(0) ... ^{N-l) 



{N+l)mx(N+l)m 



(14) 



5](-iV) 5](-iV + l) ... S(0) 
where each submatrix is of size m x m. For a zero-mean vector process y = {y(0), . . . ,y(A^)}, the 
submatrices are given by X1(t) = E[y(0)y(T)-^]. 

Theorem 5. Consider a block-Toeplitz p.d. matrix of size {N + l)m x {N + l)m, with \\T,{t)\\p < 

C'v?^'^\q for all r = — A^, . . . ,N and constant u G (0, 1). Choose 

\og{pq/e) 

r ^ . 

log(l/u) 

Then, the PRLS algorithm estimates Sq to an absolute tolerance e G (0, 1) with convergence rate 
guarantee: 



< e + r 



p^ + + log M 



n 



(15) 



for appropriately scaled A. 



Proof: See Appendix G. ■ 

The exponential norm decay condition of Thm. [5] is satisfied by a first-order vector autoregressive 
process: 

Zt = '^Zt-i + £t (16) 

with u = ||$||2 G (0, 1), where Zj. For £t ~ A^(0, S^), this is a multivariate Gaussian process. Collecting 
data over a time horizon of size + 1, we concatenate these observations into a large random vector 
z of dimension {N + l)m, where m is the process dimension. The resulting covariance matrix has the 
block-Toeplitz form assumed in Thm. |5] Figure [3] shows bounds constructed using the Frobenius upper 



bound on the spectral norm in ( 13 1 and using the projection matrix as discussed in the proof of Thm. 
[5] The bound given in the proof of Thm. [5] (in black) is shown to be linear in log-scale, thus justifying 
the exponential decay of the Kronecker spectrum. 
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Kronecker Spectrum 



Singular vals. 




5 10 15 20 25 30 35 

k 

Fig. 3. Kronecker spectrum and bounds based on Lemma [T] The upper bound 'Bound - frob' (in green) is obtained using 
the bound ( |13^ using the basis associated with the minimum £2 approximation error (i.e., the optimal basis computed by SVD 
as outlined in the equality condition of Lemma [TJ. The upper bound 'Bound GS - frob' (in magenta) is constructed using the 
variational bound \13\ with projection matrix having columns drawn from the orthonormal basis constructed in the proof 
of Thm. |5] The upper bound 'Bound GS - frob 2' (in black) is constructed from the bound l |39[ > in the proof of Thm. |5] 

V. Simulation Results 

We consider dense positive definite matrices XIq of dimension d = 625. Taking p = g = 25, we note 
that the number of free parameters that describe each Kronecker product is of the order p'^ + q'^ ~ p^, 
which is essentially of the same order as the number of unknown parameters required to specify each 
eigenvector of Sq, i.e., pq ~ p"^. 

A. Sum of Kronecker Product Covariance 

The covariance matrix shown in Fig. |4] was constructed using ([1]) with r = 3, with each p.d. factor 
chosen as CC^, where C is a square Gaussian random matrix. Fig. [s] shows the empirical performance 
of covariance matching (CM) (i.e., solution of ([5]l with r = 3), PRLS and SVT (i.e., solution of ([3])). 
We note that the Kronecker spectrum contains only three nonzero terms while the true covariance is full 
rank. The PRLS spectrum is more concentrated than the eigenspectrum and, from Fig. |5} we observe 
PRLS outperforms covariance matching (CM), SVT and SCM across all n. 
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Fig. 4. Simulation A. True dense covariance is constructed using the sum of KP model l[TJ, with r = 3. Left panel: True 
positive definite covariance matrix Sq. Middle panel: Kronecker spectrum (eigenspectrum of So in permuted domain). Right 
panel: Eigenspectrum (Eigenvalues of So). Note that the &onecker spectrum is much more concentrated than the eigenspectrum. 

'g'6Z2, (p,q)=(25, 25),d = 625 




Sample size (n) 



Fig. 5. Simulation A. Normalized MSE performance for true covariance matrix in Fig. |4] as a function of sample size n. 
PRLS outperforms CM, SVT (i.e., solution of ^) and the standard SCM estimator. Here, p = q = 25 and Nmc = 80. For 
n = 20, PRLS achieves a 7.91 dB MSE reduction over SCM and SVT achieves a 1.80 dB MSE reduction over SCM. 



B. Block Toeplitz Covariance 

The covariance matrix shown in Fig. [6] was constructed by first generating a Gaussian random square 
matrix <^ of spectral norm 0.95 < 1, and then simulating the block Toeplitz covariance for the process 
shown in ( 16 1. Fig. |7] compares the empirical performance of PRLS and SVT (i.e., the solution of ([3]) with 
appropriate scaling for the regularization parameter). We observe that the Kronecker product estimator 
performs much better than both SVT (i.e., the solution of ([3])) and naive SCM estimator. This is most 
likely due to the fact that the repetitive block structure of Kronecker products better summarizes the 
covariance structure. We observe from Fig. |6] that for this block Toeplitz covariance, the Kronecker 
spectrum decays more rapidly (exponentially) than the eigenspectrum. 
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True Covariance MalriK Kronecker Spectrum Eigenspeclrum 




Fig. 6. Simulation B. True dense block-Toeplitz covariance matrix. Left panel: True positive definite covariance matrix Sq. 
Middle panel: Kronecker spectrum (eigenspectrum of So in permuted domain). Right panel: Eigenspeclrum (Eigenvalues of 

So). Note that the Ki^onecker spectrum is much more concentrated than the eigenspectrum. 

r„ = 366, (p,q)=(25, 25), d = 625 




Sample size (n) 

Fig. 7. Simulation B. Normalized MSE performance for covariance matrix in Fig. [6] as a function of sample size n. PRLS 
outperforms SVT (i.e., solution of |3|) and the standard SCM estimator. Here, p = g = 25 and Nmc = 80. For n = 108, 
PRLS achieves a 6.88 dB MSE reduction over SCM and SVT achieves a 0.37 dB MSE reduction over SCM. Note again that 
the Kronecker spectrum is much more concentrated than the eigenspectrum. 

VI. Application to Wind Speed Prediction 

In this section, we demonstrate the performance of PRLS in a real world application: wind speed 
prediction. We apply our methods to the Irish wind speed dataset and the NCEP dataset. 

A. Irish Wind Speed Data 

We use data consisting of time series consisting of daily average wind speed recordings during the 
period 1961 — 1978 at (7 = 11 meteorological stations. This data set has many temporal coordinates, 
spanning a total of ntotai = 365 • 8 = 2920 daily average recordings of wind speed at each station. More 
details on this data set can be found in |[38l . |[39l . BOl . PH and it can be downloaded from Statlib 
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http://lib.stat.cmu.edu/datasets. We used the same square root transformation, estimated seasonal effect 
offset and station-specific mean offset as in |38], yielding the multiple (11) velocity measures. We used 
the data from years 1969 — 1970 for training and the data from 1971 — 1978 for testing. 

The task is to predict the average velocity for the next day using the average wind velocity in each of 
the p—1 previous days. The full dimension of each observation vector is d = pq, and each d-dimensional 
observation vector is formed by concatenating the p time-consecutive g-dimensional vectors (each entry 
containing the velocity measure for each station) without overlapping the time segments. The SCM was 
estimated using data from the training period consisting of years 1969 — 1970. Linear predictors over the 
time series were constructing by using these estimated covariance matrices in an ordinary least squares 
predictor. Specifically, we constructed the SCM linear predictor of all stations' wind velocity from the 
p—1 previous samples of the q = 11 stations' time series: 

vt = S2,iS];}v(_i.t_(p_i) (17) 

where Vi_i.(_(p_i) e ]R(p^^)'^ is the stacked wind velocities from the previous p—1 time instants and 
^2,1 e M«'^'?(P-i) and Em G M9(p-i)x'?(p-i) are submatrices of the qp x qp standard SCM: 

Sl,l 5]l,2 
^2,1 "^2,2 

The PRLS predictor was similarly constructed using our proposed estimator of the qp x qp Kronecker 
sum covariance matrix instead of the SCM. The coefficients of each of these predictors, S2,iS|fp were 
subsequently applied to predict over the test set. 

The predictors were tested on the data from years 1971 — 1978, corresponding to ritest = 365-8 = 2920 
days, as the ground truth. Using non-overlapping samples and p = 8, we have a total of n = [^^^^] = 91 
training samples of full dimension d = 88. 

Fig. [8] shows the Kronecker product factors that make up the solution of Eq. Q with r = 1 and the 
PRLS estimate. The PRLS estimate contains re// = 6 nonzero terms in the KP expansion. 



Fig. 10 shows the root mean squared error (RMSE) prediction performance over the testing period 
of 2920 days for the forecasts based on the standard SCM, PRLS estimator, Tyler's ML estimator B2ll . 
and regularized Tyler Bill . The PRLS estimator was implemented using a regularization parameter A„ = 



^ll'^n||2\/ ^ "'"'^ +i°g(max(p,g,n)) ^^^.j^ ^ _ g jjjg constant C was choscn by optimizing the prediction 
RMSE on the training set over a range of regularization parameters A parameterized by C. The regularized 
Tyler estimator was implemented using the data-dependent shrinkage coefficient suggested in Eqn. (13) 
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Fig. 8. Irish wind speed data: Sample covariance matrix (SCM) (top left), PRLS covariance estimate (top right), temporal 
Kronecker factor for first KP component (bottom left) and spatial Kronecker factor for first KP component (bottom right). 
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Fig. 9. Irish wind speed data: Kronecker spectrum of SCM (left) and Eigenspectrum of SCM (right). The first and second KP 
components contain 94.60% and 1.07% of the spectrum energy. The first and second eigenvectors contain 36.28% and 28.76% 
of the spectrum energy. The KP spectrum is more compact than the eigenspectrum. 



in II421 . Fig. 11 shows a sample period of 150 days. We observe that PRLS tracks the actual wind speed 
better than the SCM-based predictor does. 
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Fig. 10. Irish wind speed data: RMSE prediction performance across q stations for linear estimators using SCM (blue), PRLS 
(green), Tyler's MLE (cyan) and regularized Tyler (magenta). PRLS achieves an average reduction in RMSE of 3.32 dB as 
compared to SCM (averaged across stations) and regularized Tyler achieves an average RMSE reduction of 2.79 dB as compared 
to SCM. 



station 3 




Fig. 11. Irish wind speed data: Prediction performance for linear estimators using SCM (blue) and PRLS (green) for a time 
interval of 150 days. The actual (ground truth) wind speeds are shown in black. PRLS offers better tracking performance as 
compared to SCM. 



B. NCEP Wind Speed Data 

We use data representative of the wind conditions in the lower troposphere (surface data at .995 sigma 
level) for the global grid (90°N - 90°S, 0°E - 357.5°E). We obtained the data from the National Centers 
for Environmental Prediction reanalysis project (Kalnay et al. [43]), available at the NOAA website 
ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface. Daily averages of U (east-west) and V 
(north-south) wind components were collected using a station grid of size 144 x 73 (2.5 degree latitude 
X 2.5 degree longitude global grid) over the years 1948 — 2012. The wind speed is computed by taking 
the magnitude of the wind vector. 

1) Continental US Region: We considered a 10 x 10 grid of stations, corresponding to latitude range 
25°N-47.5°N and longitude range 125°W-97.5°W. For this selection of variables, g = 10 • 10 = 100 is 
the total number of stations and p — 1 = 7 is the prediction time lag. We preprocessed the raw data using 
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the detrending procedure outlined in Haslett et al. II38I . More specifically, we first performed a square 
root transformation, then estimated and subtracted the station-specific means from the data and finally 



estimated and subtracted the seasonal effect (see Fig. 12 1. The resulting features/observations are called 
the velocity measures |38]. The SCM was estimated using data from the training period consisting of 




Fig. 12. NCEP wind speed data (Continental US): Seasonal effect as a function of day of the year. A 14th order polynomial 
is fit by the least squares method to the average of the square root of the daily mean wind speeds over all stations and over all 
training years. 



years 2003 — 2007. Since the SCM is not full rank, the linear preictor ( 17 1 was implemented with the 
Moore-Penrose pseudo-inverse of Xli i. The predictors were tested on the data from years 2008 — 2012 
as the ground truth. Using non-overlapping samples and p = 8, we have a total of n = [^^^] = 228 
training samples of full dimension d = 800. 

Fig. 13 shows the Kronecker product factors that make up the solution of Eq. Q with r = 1 and the 
PRLS covariance estimate. The PRLS estimate contains r^ff = 6 nonzero terms in the KP expansion. 



Fig. 15 shows the root mean squared error (RMSE) prediction performance over the testing period of 
1825 days for the forecasts based on the standard SCM, PRLS, and regularized Tyler The PRLS 
estimator was implemented using a regularization parameter A„ = C'||Sn||2'y^ ^ "^'^ +i°g(™ax(p,(? ^n)) ^-^j^ 
C = 0.036. The constant C was chosen by optimizing the prediction RMSE on the training set over a 



range of regularization parameters A parameterized by C (as in Irish wind speed data set). Fig. 16 shows 
a sample period of 150 days. It is observed that SCM has unstable performance, while the Kronecker 
product estimator offers better tracking of the wind speeds. 

2) Arctic Ocean Region: We considered a 10 x 10 grid of stations, corresponding to latitude range 
90°N-67.5°N and longitude range 0°E-22.5°E. For this selection of variables, g = 10 • 10 = 100 is the 



March 1, 2013 



DRAFT 



18 




200 400 600 800 200 400 600 800 



KP Left Factor: Temporal KP Right Factor: Spatial 




Fig. 13. NCEP wind speed data (Continental US): Sample covariance matrix (SCM) (top left), PRLS covariance estimate 
(top right), temporal Kronecker factor for first KP component (bottom left) and spatial Kronecker factor for first KP component 
(bottom right). 




Fig. 14. NCEP wind speed data (Continental US): Kronecker spectrum of SCM (left) and Eigenspectrum of SCM (right). The 
first and second KP components contain 85.88% and 3.48% of the spectrum energy. The first and second eigenvectors contain 
40.93% and 23.82% of the spectrum energy. The KP spectrum is more compact than the eigenspectrum. 



total number of stations and p — 1 = 7 is the prediction time lag. We preprocessed the raw data using 
the detrending procedure outlined in Haslett et al. |f38l|. More specifically, we first performed a square 
root transformation, then estimated and subtracted the station-specific means from the data and finally 



estimated and subtracted the seasonal effect (see Fig. 17). The resulting features/observations are called 
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Fig. 15. NCEP wind speed data (Continental US): RMSE prediction performance across q stations for linear estimators using 
SCM (blue) and PRLS (green). PRLS achieves an average reduction in RMSE of 1.90 dB as compared to SCM (averaged across 
stations) and regularized Tyler achieves an average RMSE reduction of 0.66 dB as compared to SCM. 
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Fig. 16. NCEP wind speed data (Continental US): Prediction performance for linear estimators using SCM (blue) and PRLS 
(green) for a time interval of 150 days. The actual (ground truth) wind speeds are shown in black. PRLS offers better tracking 
performance as compared to SCM. 



the velocity measures |38]. The SCM was estimated using data from the training period consisting of 



years 2003 — 2007. Since the SCM is not full rank, the linear preictor ( 17 1 was implemented with the 
Moore-Penrose pseudo-inverse of Xli i. The predictors were tested on the data from years 2008 — 2012 
as the ground truth. Using non-overlapping samples and p = 8, we have a total of n = [^^^^] = 228 
training samples of full dimension d = 800. 

Fig. 1 8 shows the Kronecker product factors that make up the solution of Eq. (|5]l with r = 1 and the 
PRLS covariance estimate. The PRLS estimate contains re// = 2 nonzero terms in the KP expansion. 



Fig. 20 shows the root mean squared error (RMSE) prediction performance over the testing period of 
1825 days for the forecasts based on the standard SCM, PRLS, and regularized Tyler Ell. The PRLS 
estimator was implemented using a regularization parameter A„ = (7||Sn||2-y ^ "'"^ +iog(max(p,9 ^7i)) ^^^.j^ 
C = 0.073. The constant C was chosen by optimizing the prediction RMSE on the training set over a 
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Fig. 17. NCEP wind speed data (Arctic Ocean): Seasonal effect as a function of day of the year. A 14th order polynomial 
is fit by the least squares method to the average of the square root of the daily mean wind speeds over all stations and over all 
training years. 



range of regularization parameters A parameterized by C (as in Irish wind speed data set). Fig. 21 shows 



a sample period of 150 days. It is observed that SCM has unstable performance, while the Kronecker 
product estimator offers better tracking of the wind speeds. 

VII. Conclusion 

We have introduced a framework for covariance estimation based on separation rank decompositions 
using a series of Kronecker product factors. We proposed a least-squares estimator in a permuted linear 
space with nuclear norm penalization, named PRLS. We established high dimensional consistency for 
PRLS with guaranteed rates of convergence. The analysis shows that for low separation rank covariance 
models, our proposed method outperforms the standard SCM estimator. For the class of block-Toeplitz 
matrices with exponentially decaying off-diagonal norms, we showed that the separation rank is small, 
and specialized our convergence bounds to this class. We also presented synthetic simulations that showed 
the benefits of our methods. 

We emphasize that our analysis is in general non-asymptotic, in the sense that probabilistic bounds are 
derived that hold with a certain probability and this probability becomes higher as the number of sample 
and/or variables tend to infinity. 

As a real world application we demonstrated the performance of the proposed Kronecker product-based 
estimator in wind speed prediction using an Irish wind speed dataset and a recent US NCEP dataset. 
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Fig. 18. NCEP wind speed data (Arctic Ocean): Sample covariance matrix (SCM) (top left), PRLS covariance estimate (top 
right), temporal Kronecker factor for first KP component (bottom left) and spatial Kronecker factor for first KP component 
(bottom right). 
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Fig. 19. NCEP wind speed data (Arctic Ocean): Kronecker spectrum of SCM (left) and Eigenspectrum of SCM (right). The 
first and second KP components contain 91.12% and 3.28% of the spectrum energy. The first and second eigenvectors contain 
47.99% and 19.68% of the spectrum energy. The KP spectrum is more compact than the eigenspectrum. 



Implementation of a standard covariance-based prediction scheme using our Kronecker product estimator 
achieved performance gains as compared to standard with respect to previously proposed covariance-based 
predictors. 

There are several questions that remain open and are worthy of additional study. First, while the 
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Fig. 20. NCEP wind speed data (Arctic Ocean): RMSE prediction performance across q stations for linear estimators using 
SCM (blue) and PRLS (green). PRLS achieves an average reduction in RMSE of 4.64 dB as compared to SCM (averaged across 
stations) and regularized Tyler achieves an average RMSE reduction of 3.41 dB as compared to SCM. 
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Fig. 21. NCEP wind speed data (Arctic Ocean): Prediction performance for linear estimators using SCM (blue) and PRLS 
(green) for a time interval of 150 days. The actual (ground truth) wind speeds are shown in black. PRLS offers better tracking 
performance as compared to SCM. 

proposed penalized least squares Kronecker sum approximation yields a unique solution, the solution 
requires specification of the parameter A, which specifies both the separation rank, and the amount 
of spectral shrinkage in the approximation. It would be worthwhile to investigate optimal or consistent 
methods of choosing this regularization parameter, e.g. using Stein's theory of unbiased risk minimization. 
Second, while we have proven positive definiteness of the Kronecker sum approximation when the 
number of samples is greater than than the variable dimension in our experiments we have observed 
that positive definiteness is preserved more generally. The positive definiteness conditions should be 
investigated further. Finally, maximum likelihood estimation of Kronecker sum covariance and inverse 
covariance matrices is a worthwhile open problem. 
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Appendix A 
Proof of Theorem[T] 

Proof: 

1) Note that the singular value decomposition of R„ = 7^(S„) is equivalent to the solution of the 
minimum norm problem: 

r 

mill ||S„ Afc (8)Bfc|||^ (18) 

k=l 

This problem is solved by a permuted SVD f37|, which can be equivalently formulated as a 
successive rank-1 approximation problem in a different linear space and each rank-1 approximation 
problem can be written as an iterative system |[29l . After some algebra, this sequence of iterative 
systems takes the form: 

Ai = A(Bi) 
Bi = B(Ai) 

A„ = A(B„) - ^ < A„, A(B„) > A„, 2 < n < r 

a=l 

B. = B(A„) - X: ^^^^tI^B,, 2<u<r (19) 

/3=1 I|B/3|If 

for any initial p.d. starting matrix Bu,n > 1. The operators A(-) and B(-) are defined as: 

A(B) = — ^ Yl [B]fc,iS„(/, k) (20) 



\F k,l=l 



B(A) 



1 ^ 

J][A],-,S„(j,i) (21) 



Since the operators A(-) and B(-) inherit symmetry from S„, the proof is complete. 
2) Recall that the SCM S„ is positive definite with probability 1 if n > pq. First, consider the minimum 
norm problem (18 1. From the first part of the proposition, it follows that the factors A^ and B^ 



must be symmetric. If we show that there exists a solution to (18 1 with p.d. Kronecker factors, 
then the weighted sum with positive scalars is also p.d. and as a result, the PRLS solution given 
by = X]fc=i (^'^fc(R'n) ~ Ufc (g) Vfc is positive definite (see Eqn. 6 1. 
Fix / G {1, . . . , r}. We will show that A; and B/ are p.d. matrices. From eigendecomposition, it fol- 
lows that there exist orthonormal matrices and and diagonal matrices = diag{dj, . . . , d^) 
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and Ai = dia.g{Xj, . . . , A?) such that: 



Ai = SfBiSi 



Set = (g) H;. Define = Q^SnQi- The objective ( 18 1 can be rewritten as: 



\Sn-^A, 



B 



k=l 



IQf s. 



k=l 



\Fi-^Qf{Ak0Bk)Q 



2 

iWf 



k=l 



\Fi-^{^fAk^i)(^{SfBkSi 

k^l 



Ml 



; Al^l) 09 {a^ oiai)\\p 

2 



\\Mi--Di^Ai\\l 
\\Mi\\l + \\T>i AiWl - 2tr (Fz(Di A;)) 
+ 2 J] tr((*f Afc*; Hf Bfc3i)(Dz A;)) 

k^l 

l|Mz||^-||Fz||^ + ||F,-D,®A/||^ 
+ 2j]tr(BfcB0tr(CfcQ) 

l|Md|^-||F,||^ + ||Fi-diag(FO||^ 



||diag(Fz) -Bi(g)A, 



iWf 



where we used the orthogonality of Kronecker factors induced by the SVD solution in the last 
step. We note that the term ||M;||^ - ||F;||^ + ||F; - diag(F/)|||, is independent of D/, A^. The 
positive definiteness of S„ implies that the diagonal elements of F; are all positive. Let diag(Fi) = 
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diag({/(i_i)q+j}jj) > 0. Direct algebra yields: 

||diag(F;)-Di® A,||^ 

P g 

i=l j=l 

p q 
i=i j=i 

P Q 

i=i j=i 

We note that the term Yl^=iY^'j=i(.f{i-i)q+j ~ I'^ill'^il)^ invariant to any sign changes of the 
eigenvalues {di,Xj}ij. By contradiction, it follows that the eigenvalues {di} and {A^} must all 
have the same sign (if not, then the minimum norm is not achieved by (A^jB;)). Without loss of 
generality (since Ai^'Bi = (— Ai)(g)(— B;), we assume the sign is positive. We conclude that there 
exist p.d. matrices (Ai,B;) that achieve the minimum norm. Unfixing I and using the convexity 
of the PRLS objective (HI) concludes the proof. 



Appendix B 
Proof of Theorem[2] 

Proof: The proof generalizes Thm. 1 in 1 1 ] to nonsquare matrices. A necessary and sufficient 
condition for the minimizer of ^ is that there exists a V e 9||R^||^ such that: 

< 2(R^ - R„) + AV, - R >< (22) 



for all R. From (22 1, we obtain for any V G (?||R||]^: 



2 < R^ - Ro, R^ - R > +A < V - V, R^ - R > 

< -A < V,R^ - R > +2 < R„ - Ro,R^ -R > (23) 

The monotonicity of subdifferentials of convex functions implies: 

<V-V,R^-R>>0 (24) 

From Example 2 in ll44ll . we have the characterization of the subdifferential of a nuclear norm of a 
nonsquare matrix: 

d\\R\l = i ^u,-(R)v,(R)^ + P^WP^ : ||W||2 < 1 
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where r = rank(R), U = spanjuj} and V = spanjvj}. Thus, for R = X]j=i '7j(R)ujvJ, r 



we can write: 



where W can be chosen such that ||W||2 < 1 and 



<P^WP^,R^-R>= IIP^R^P^I 



Next, note the equaUty: 



R-^ "D 11^ _L IIT?-^ "DIM ll"E? T> 
I — rLQll^ ~r 11-^ — -•^IIf — " — ' 



■oIIf 



2 < R — Rfi, R — R > 



Using (24 1, (26 1 and (27 1 in (23 1, we obtain: 



IR"^ — Roll^ + ||R'*' — R||^ + A||P['^R''*Py||^ 



< ||R-Ro||^ + A< J]]ujvJ,-(R^-R) > 

i=i 

+ 2 < R„ — Rq , R"^ — R > 



From trace duaUty, we have: 



< J]u,vJ,-(R^-R) > 



=< Pu "ivJPv, -(R^ - R) > 

r 

< |lE^^-^Jll2llP?^(R-'-R)PvlL 



|Pt/(R^-R)Py| 



where we used the symmetry of projection matrices. Using this bound in (28 1, we obtain: 



< ||R - Roll^ + A||Pt/(R^ - R)Py I 



+ 2 < A„,R-^ - R > 
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where A„ = R„ — Rq. Define the orthogonal projection of R onto the outer product span of U and V 
as Vu,v(R) = R — Pf'^RPy. Then, we decompose: 

< A„,R^-R>=< An,Pc/v (r^-r) > 



+ < A„,PcL(R^-R)P^> 
By the Cauchy-Schwarz inequality and trace-duality: 



\\Pu(Fi - R)PyL < Vrank(R)||R^ - R||^ 
I < An,Vu,v{n^ - R) > I < WAnUVuyi-R^ - R)L 

< ||A„||2 V'2rank(R)||R^ - R|| 
< A„,P^(R^ - R)P^ > I < ||A„||2||P^R^P^||, 



where we used Pt^RP^ = 0. Using these bounds in (29 1, we further obtain: 



|R^ - Rolll. + ||R^ - R|||. + (A - 2|| A„||2)||Pf^R^P 



v\ 



< ||R - Roll^ + ((2^/2|| A„||2 + A) V^)(a/ ||R^ - R|If) (30) 



Using the arithmetic-mean geometric-mean inequality in the RHS of (30 1 and the assumption A > 
2||A„||2, we obtain: 



IR'^ — R-olli^ < IIR- — R-oIIf H ^ — . r 



This concludes the proof. 



Appendix C 
LemmaE] 

Lemma 2. (Concentration of Measure for Coupled Gaussian Chaos) Let X and Y be arbitrary unit- 
Frobenius norm matrices and let x G W'^ and y € M^^ be reshaped versions of X and Y. In the SCM 
(|2]) assume that {z^} are i.i.d. multivariate normal ~ Sq). Recall A„ in ([9]). For all r > 0; 

P(|x^A„y| > r) < 2exp f ) (31) 

where Ci = -A= ~ 2.5044 and C2 = e\/2 ~ 3.8442 are absolute constants. 

Von 
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Proof: This proof is based on large deviation theory for Gaussian matrices. Note that by the definition 
of the reshaping permutation operator 7^(-), we have: 

rvec(zt(l)zj(l)^)^ - E[vec(zt(l)zt(l)^)^] 

1 " 
n 

[vec(zt(p)zt(p)^)^ - E[vec(zt(p)zj(p)^)^] 
where Zf(i) = [zf](j_i)g^i.j^ is the ith subvector of tth observation zt. Thus, we can write: 

1 " 

t=i 

where 

i j=l A:,«=l 

X ([Zt](i-l)g+fc[zt](j_l)g+i - IE[[zt](j„i),+fc[zt](j-_l)q+/]) (32) 

and X G Rp^p and Y € W'^'^ are reshaped versions of x and y. Defining M = X (g) Y, we can write 



(32 1 as: 



V^f = zf Mzt - E[zf Mzt] 



The statistic (32) has the form of Gaussian chaos of order 2. Many of the random variables involved in 



the summation (32 1 are correlated, which makes the analysis difficult. To simplify the concentration of 
measure derivation, using the joint Gaussian property of the data, we note that the stochastic equivalent 
of zJ'Mzt is /3fM/3(, where M = Sy^MEg^^, where /3j ~ N{0,lpq) is a random vector with i.i.d. 
standard normal components. By this decoupling argument, we have: 

E|V^i|2 = E|/3fM/3,-E[/3fM/3,]f 



E 



11^12 «1 = 1 



ii^i2 ii 

||M||^ + ||diag(M)||^ 

|2||AT||2 



< 2||M||^ < 2||5:o||2||M||^ = 2||S;o||2 
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where in the last step we used ||M||^ = ||X||p,|| Y||^ = 1. 
It can be shown (see Appendix A in [45 1) that for all m > 2: 

ElV'tl™ < mW'^-'^vtl'l (33) 

where 

W = e^yE\iPt\^ < e\/2||I], 



0II2 



From Bernstein's inequality (see Thm. 1.1 in P31 '). we obtain: 



- V > r < 2 exp — 

n \nvi + WriT J 

-nrV2 



< 2exp 



Cl||^o|l2 + C'2||So||2''" 



This concludes the proof. 



Appendix D 
Proof of Theorem[3] 

Proof: Let J\f{S'^'~^ ,e') denote an e'-net on the sphere S'^'^^. Let xi € S^"^^ and yi G 5'?'"^ be 
such that I < xi, A„yi > | = ||A„||2. Let ||xi— X2II2 < e' and ||yi — y2||2 ^ e'- By the Cauchy-Schwarz 
inequality, for X2 G AA(5p'-\ e'), y2 e M{S'i"'^ ,e'): 

|xf A„yi| - |x2 A„y2| < |xf A„yi -x^A„y2| 
= I < xi, A„(yi - y2) > + < xi - X2, A„y2 > | 
< 2e'||A, 



'n||2 



This implies: 



|A„||2 < (l-2e')"' max |x^ A„y| (34) 

xeAr(5P=-i,e')>y6A^('S9'-i,e') 



From Lemma 5.2 in B6ll . we have the bound on the cardinality of the e'-net on the unit Euclidean sphere 

c3sd{N{S'^'-\e'))<{l + j}j (35) 
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From ([34]|, (|35]l and the union bound: 



^nlU > e) 



|x^A„y| > e(l - 2e')) 



< r I max 



< P I U |x^A„y| > e(l - 2e') 

\xeAfiSP^-\e'),y€Af{Si^-Ke') 

X max P(|x'^A„y| > e(l - 2e')) 



|x^A„y| > e(l - 2e')) 



Using Lemma [2j we further obtain: 

l^nlla > e) 



<2(1 + - 



exp 



-ne\l - 2e'y/2 



(36) 



^Ci||I]o||^ + C2||So||2e(l-2e')y 
We finish the proof by considering the two separate regimes. First, let us consider the Gaussian tail 
regime which occurs when n > (^)^(p^ + + logM) and choose: 



l-2e' 



n 



For this regime, the bound (36 1 can be relaxed to: 



< 2 1 + 



Then, from (37 1, we have: 



Anils > 



< 2 1 + 



V 2C1IIS0II2 



|5^0|l2. /p' + ^' + logA/ 



(37) 



1 - 2e' 



exp 



n 

t^{p^ + + logM) 



4Ci 



<2 1 + 



< 2M-*'/(4^i) 
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This concludes the bound for the first regime. Second, the exponential tail regime follows by similar 

tC2 



arguments. Assuming n < ^(p^ + + logM), we obtain from (36 1 



t||So||2ff^ + g^ + logM 
2^^+"^ ( -t{p^+q^+\ogM) 

— ^2 — 



e' 

where we used the assumption t > 46*2 ln(l + The proof is complete by combining both regimes 
and taking Co > large enough and noting that t> \. ■ 

Appendix E 
Proof of Theorem|4] 

Proof: Define the event 

ii;, = |||R^-Ro||^> inf llR-Roll^ + ^i^/^A^r 

R:rank(R)<r 4 

where A„ is chosen as in the statement of the theorem. 

Theorem [2] implies that on the event A > 2||A„||2, with probability 1, we have for any 1 < r < ro: 

||R^-Ro||p< inf ||R-Rofp+ (^+/^^^ aV 

" " - R:rank(R)<r" 4 

Using this and Theorem |3] we obtain: 

= ¥{Er n {A„ > 2|| AnIIJ) + ¥{Er n {A„ < 2|| AnIIJ) 
< + P(A„ < 2||A„||2) 
A II >^ 



X max • 



+ + log M + q^ + log M 



n \ n 



< 2M"*/^*^ 

This concludes the proof. 

^Note that this constant depends on the constants t, Ci , C2 > 0. 
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Appendix F 
Proof of Lemma[T] 

Proof: From the min-max theorem of Courant- Fischer- Weyl |47]: 

a^+i(R) = Afe+i(RR^) 

= min max < RR"^v, v > 

V:dim(V^)<k \\^f\\^=l,-v(^V 

Define the set Vfc = {v G W'" : ||v||2 = 1, v _L Col(RPfcR^)} C Sp'-\ Choosing V = Col(RPfcR'^)-L, 
we have the upper bound: 

(t| , i(R) < max < RR^v, v > 
Using the definition of Vk and the orthogonaUty principle, we have: 

< RR^v, V > =< R(I - Pfc)R^v, V > 
=< (I - Pfc)R^v, R^v > 
=< (I - Pfc)R^v, (I - Pfc)R^v > 
= ||(I-Pfc)R^v||2 

Using this equality and the definition of the spectral norm BTl : 

a2^i(R) < max 11(1- Pfc)R^v||2 

< max 11(1 - Pfc)R^v||2 

= ||(I-Pfe)R^||^ 

Equality follows when choosing P^ = V^V^. This is seen by writing I = VV-^ and using the definition 
of the spectral norm and the sorting of the singular values. The proof is complete. 

■ 

Appendix G 
Proof of Theorem[5] 

Proof: Note that (A, u) is an eigenvalue-eigenvector pair of the square symmetric matrix Rg Rq if: 

^vec(I]o(«,i)) < u,vec(5]o(i,i)) >= Au (38) 

So for A > 0, the eigenvector u must lie in the span of the vectorized submatrices {vec(Xlo(i, 
Motivated by this result, we use the Gram-Schmidt procedure to construct a basis that incrementally 
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spans more and more of the subspace span({vec(So(i,i))}jj). For the special case of the block-ToepUtz 
matrix, we have: 

span({vec(So(i, = span({vec(S(r))}f=_;v) 

By stationarity, note that X;(-r) = X;(r)'^. 

For simplicity, consider the case k = 2k' + 1 for some k' > 0. From Lemma [T] we are free to choose 
an orthonormal basis set {vi, . . . , v^} and form the projection matrix = V^V^, where the columns 
of Vjk are the vectors {vj}. We form the orthonormal basis using the Gram-Schmidt procedure BTl : 

vo = vec(S(0)), 
vo 



vo 



^^^^\\ < vec(S(l)),vo 
vi = vec(S(l)) — "2 Vo, 

IIV0II2 

Vl 

IIV1II2 

f^f 1^^ < vec(5](-l)),vo 
v_i = vec(S(-l)) — "2 VQ 

IIV0II2 

< vec(S(-l)),vi >^ 

II- ii2 '^l' 

llvilla 

V-l 



V-l 

IIV-1II2 

etc. 



With this choice of orthonormal basis, it follows that for every k = 2k' + 1, we have the orthogonal 
projector: 

Pfc = VQVo + ^(vivf + v_iv!^ 



k' 

1=1 
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This corresponds to a variant of a sequence of Householder transformations BTl . Using Lemma [T] 

ai+i(Ro)< ||Ro(I-Pa.)II2 

< llRn — RnPi 



N 

<pY1 wmwl + m- 



(39) 



l=k' + l 



N 



< 2C'pq u 

l=k' + l 
,2k'+2 



21 



< 2C'pq 

< 2C'pq 



u 



l-v? 

k 



U 



l-V? 



where we used Lemma |3] to obtain (|39]). To finish the proof, using the bound above and ( [T2] ): 



ro-l 

R:rank(Rj<r — 

k=r 



l-n2 
< 2C'pq 



k=r 



{l-uf 



The proof is complete. 



Appendix H 
Lemma [3] 

Lemma 3. Consider the notation and setting of proof of Thm. |5] Then, for the projection matrix 
chosen, we have for k = 2k' + 1, k' > 1: 



N 



a2^i(Ro) < ||Ro-RoPfc||^ <P Yl II^(OIIf + II^(- 

l=k' + l 



Proof: To illustrate the row-subtraction technique, we consider the simplified scenario k' = I. The 
proof can be easily generalized to all k' > I. Without loss of generality, we write the permuted covariance 

I](0) 

S(-l) 5](0) 



(40) 



March 1, 2013 



DRAFT 



35 



Ro = R(So) 



as: 

" vec(5](0))^ 
vec(S(l))^ 
vec(5](-l))^ 
vec(5](0))^ 

Using the Gram-Schmidt submatrix basis construction of the proof of Thm. [5j the sequence of projection 
matrices can be written as: 

Pi = VqV^ 



P2 

Pa 



VoVq + vivf 

V0V(f + vivf + V_iV^i 



where Vj is the orthonormal basis constructed in the proof of Thm.js] The singular value bound (T^(Ro) < 
llRoll^ = 2||S(0)||^ + ||S(1)||^ + m-ml is trivial m. 
For the second singular value, we want to prove the bound: 

,2/-o \ / ll%n/i\l|2 



a2^(Ro)<||S(l)||^ + ||S(-l)||^ 
To show this, we use the variational bound of Lemma [TJ 



(41) 



o"|(Ro) < ||Ro — RoPilliT' 

vec(5](0))^- 
vec(S;(l))^- 
vec(S(-l))^- 
vec(S(0))^- 



vec(I](l))^- 
vec(S(-l))^- 



< vec(5](0)),vo > v|f 

< vec(S(l)),vo > vjf 

< vec(5](-l)),vo > vjf 

< vec(i:(0)),vo > v|f 

0^ 

< vec(S(l)),vo > v;[ 

< vec(S(-l)),vo > 







T 



|vec(5:(l))- < vec(S(l)),vo > vo||; 
f ||vec(5](-l))- < vec(S(-l)),vo > vo| 



1^ 
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where in the last step, we used the Pythagorean principle from least-squares theory B8t -i.e.|| A 

|2 / II A l|2 



<A,B> 
liBlil 



B||^ < II A 11^ for any matrices A,B of the same order. Next, we want to show 



ai(Ro)<||S(-l)||^ 



(42) 



Define = vec(I](j))— < vec(I](j)), vq > vq. Using similar bounds and the above, after some 
algebra: 



^li^o) < ||Ro — RoP2|li? 



0^ 



7(1)^- < vec(S(l)),vi > vf 
7(-l)^- <vec(5](-l)),vi > vf 
0^ 

= ||vec(S(-l))^- < vec(5:(-l)), vo > 

- < vec(5](-l)),vi > vf||2 
= ||vec(5](-l))^||2 - I < vec(S(-l)), vo > ^ 

-I < vec(5](-l)),vi > |2 
<I|S(-1)||^ 

where we observed that 7(1) =< vec(S(l)), vi > vi and used the Pythagorean principle again. 

Using P3 and similar bounds, it follows that c7|(Ro) = 0, which makes sense since the separation 



rank of (40l is at most 3. Generalizing to A;' > 1 and noting that ||So||^ = p||S(0)||^ + X]f=i (P 

|2 



+ IIS 



^<p||s(o)||^+pEf=i lis 



1^, we conclude the proof. 
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